How slow is distinct - 2nd - Mailing list pgsql-sql

From Michael Contzen
Subject How slow is distinct - 2nd
Date
Msg-id NFBBLKLMPMLOHOLPGAFPIEMFCAAA.Michael Contzen@Dohle.com
Whole thread Raw
Responses Re: How slow is distinct - 2nd
Re: How slow is distinct - 2nd
List pgsql-sql
Hello,

I posted some observations to the performance of postgres some weeks
ago.
The problem with the poor performance of "select distinct" still exists,
but I tried to worked out some reproducable results in a less complicated
way than in my first postings.

'select distinct' preforms on Oracle  about 30 times faster than in
Postgres.

Well, only 'select distinct' is slow. All other parts of our application
using Postgres
perform quite well on Postgres (with really big amounts of data).


(All tests on Postgres 7.3b1 and Oracle 9.0.1.3.0.
Hardware: 4 * 1800 MHz Xeon, 3 * 100GB IDE Software-Raid0)
Suse-Linux, 2.4.18 SMP-kernel)


Here the table:

mc=# \d egal
     Table "public.egal"
 Column |  Type   | Modifiers
--------+---------+-----------
 i      | integer |

mc=# select count(*) from egal;
  count
---------
 7227744
(1 row)

mc=# select count(distinct i) from egal;
 count
-------
    67
(1 row)

(The integers have values between 0 and 99 are an extract of our
warehouse and nearly distributed randomly)

The last query - count(distinct) - takes 1m30s while oracle needs 0m7s
for the same query and data. Getting the distinct rows and measuring the
time results to

time echo "select distinct i from egal;"|psql >/dev/null

real    4m52.108s
user    0m0.000s
sys     0m0.010s

Here I don't understand the difference in performance between "select
distinct i" and "select count(distinct i)" - Oracle takes constantly 7s
for each query:

time echo "select distinct i from egal;"|sqlplus xxxx/yyyy >/dev/null

real    0m6.979s
user    0m0.020s
sys     0m0.030s

In the first case (count(distinct)) postgres produces a temporary file
of 86 MB in pgsql_tmp, in the second case (select distinct) the
temp-file increases to 260 MB. (this is even larger than the table size
of egal which is 232 MB)

I think the planner has not many choices for this query so it results to
mc=# explain select distinct (i) from egal;
                                QUERY
PLAN
---------------------------------------------------------------------------
 Unique  (cost=1292118.29..1328257.01 rows=722774 width=4)
   ->  Sort  (cost=1292118.29..1310187.65 rows=7227744 width=4)
         Sort Key: i
         ->  Seq Scan on egal  (cost=0.00..104118.44 rows=7227744
width=4)
(4 rows)


Which looks similar to oracle's plan:
QPLAN
---------------
  SORT UNIQUE
    TABLE ACCESS FULL EGAL

Our problem is that things getting worse when the table size increases,
as described in my first mails. (Especially when the temp-file is split
into parts) Reorganizing/normalizing the data is no solultion, because
our application is just for analyzing the data.

I tried out the trigger-idea: inserting a value into a table with an
unique key only if still does't exists, it works, but .. select distinct
is a much faster solution.

An other way I tried is an PL/TCL function which stores all different
values into a assoziative array in memory, by far the fastest solution -
but returning the different values makes problems to me (I shouldn't
say, but I put it into a file and "copy" it back). And it's not very
nice - not as nice as "select distinct" :)

Kind regards,

Michael Contzen



Attachment

pgsql-sql by date:

Previous
From: "Matthew Geddert"
Date:
Subject: please help with converting a view in oracle into postgresql readably code
Next
From: Martijn van Oosterhout
Date:
Subject: Re: [GENERAL] CURRENT_TIMESTAMP